The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank
# To help with reading and manipulation of data
import numpy as np
import pandas as pd
# To help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To split the data
from sklearn.model_selection import train_test_split
# To impute missing values
from sklearn.impute import SimpleImputer
# To build a Random forest classifier
from sklearn.ensemble import RandomForestClassifier
# To tune a model
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
# To get different performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import (
classification_report,
confusion_matrix,
recall_score,
accuracy_score,
precision_score,
f1_score,
)
# To suppress warnings
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("BankChurners.csv")
data=df.copy()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# checking missing values in the data
data.isna().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
CLIENTNUM 0.0 Attrition_Flag 0.0 Customer_Age 0.0 Gender 0.0 Dependent_count 0.0 Education_Level 15.0 Marital_Status 7.4 Income_Category 0.0 Card_Category 0.0 Months_on_book 0.0 Total_Relationship_Count 0.0 Months_Inactive_12_mon 0.0 Contacts_Count_12_mon 0.0 Credit_Limit 0.0 Total_Revolving_Bal 0.0 Avg_Open_To_Buy 0.0 Total_Amt_Chng_Q4_Q1 0.0 Total_Trans_Amt 0.0 Total_Trans_Ct 0.0 Total_Ct_Chng_Q4_Q1 0.0 Avg_Utilization_Ratio 0.0 dtype: float64
# Making a list of all categorical variables
cat_col = [
"Attrition_Flag",
"Dependent_count",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
"Gender"
]
# Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print("-" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- 3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ----------------------------------------
for col in cat_col:
data[col] = data[col].astype("category")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null category 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null category 4 Dependent_count 10127 non-null category 5 Education_Level 8608 non-null category 6 Marital_Status 9378 non-null category 7 Income_Category 10127 non-null category 8 Card_Category 10127 non-null category 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(7), float64(5), int64(9) memory usage: 1.2 MB
data['Attrition_Flag'].unique()
['Existing Customer', 'Attrited Customer'] Categories (2, object): ['Attrited Customer', 'Existing Customer']
# checking the distribution of the target variable
data["Attrition_Flag"] = data['Attrition_Flag'].replace('Existing Customer', 0).replace('Attrited Customer',1).astype("int64")
data.drop(["CLIENTNUM"], axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null int64 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
int_cols=["Credit_Limit","Avg_Open_To_Buy"]
for col in int_cols:
data[col] = data[col].astype("int64")
from numpy import nan
# print(data[data["Education_Level"]==nan].count())
data["Education_Level"].unique()
['High School', 'Graduate', 'Uneducated', NaN, 'College', 'Post-Graduate', 'Doctorate'] Categories (6, object): ['College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated']
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "Customer_Age")
histogram_boxplot(data, "Credit_Limit")
histogram_boxplot(data, "Total_Trans_Amt")
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1")
histogram_boxplot(df, "Dependent_count")
histogram_boxplot(data, "Total_Revolving_Bal")
histogram_boxplot(df, "Total_Trans_Ct")
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "Attrition_Flag")
labeled_barplot(df, "Customer_Age")
labeled_barplot(df, "Marital_Status")
labeled_barplot(df, "Income_Category")
labeled_barplot(df, "Education_Level")
labeled_barplot(df, "Card_Category")
labeled_barplot(df, "Total_Relationship_Count")
labeled_barplot(df, "Gender")
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(data=data, hue="Attrition_Flag")
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag 1 0 All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Customer_Age", "Attrition_Flag")
Attrition_Flag 1 0 All Customer_Age All 1627 8500 10127 43 85 388 473 48 85 387 472 44 84 416 500 46 82 408 490 45 79 407 486 49 79 416 495 47 76 403 479 41 76 303 379 50 71 381 452 54 69 238 307 40 64 297 361 42 62 364 426 53 59 328 387 52 58 318 376 51 58 340 398 55 51 228 279 39 48 285 333 38 47 256 303 56 43 219 262 59 40 117 157 37 37 223 260 57 33 190 223 58 24 133 157 36 24 197 221 35 21 163 184 33 20 107 127 34 19 127 146 32 17 89 106 61 17 76 93 62 17 76 93 30 15 55 70 31 13 78 91 60 13 114 127 65 9 92 101 63 8 57 65 29 7 49 56 26 6 72 78 64 5 38 43 27 3 29 32 28 1 28 29 66 1 1 2 68 1 1 2 67 0 4 4 70 0 1 1 73 0 1 1 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag 1 0 All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag 1 0 All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
distribution_plot_wrt_target(data, "Income_Category", "Attrition_Flag")
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127.0 | 0.160660 | 0.367235 | 0.0 | 0.000 | 0.000 | 0.000 | 1.000 |
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.938679 | 9088.788539 | 1438.0 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.124617 | 9090.695763 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# defining a list with names of columns that will be used for imputation
reqd_col_for_impute = [
"Gender",
"Dependent_count",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category"
]
data[reqd_col_for_impute].head()
| Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | |
|---|---|---|---|---|---|---|
| 0 | M | 3 | High School | Married | $60K - $80K | Blue |
| 1 | F | 5 | Graduate | Single | Less than $40K | Blue |
| 2 | M | 3 | Graduate | Married | $80K - $120K | Blue |
| 3 | F | 4 | High School | NaN | Less than $40K | Blue |
| 4 | M | 3 | Uneducated | Married | $60K - $80K | Blue |
cols = data.select_dtypes(include=["category"])
for i in cols.columns:
print(df[i].value_counts())
print("*" * 30)
F 5358 M 4769 Name: Gender, dtype: int64 ****************************** 3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64 ****************************** Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ****************************** Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ****************************** Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ******************************
data["Income_Category"] = data['Income_Category'].replace('abc', "Unknown")
data[data['Income_Category']=='Unknown'].head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 0 | 45 | F | 2 | Graduate | Married | Unknown | Blue | 37 | 6 | 1 | 2 | 14470 | 1157 | 13313 | 0.966 | 1207 | 21 | 0.909 | 0.080 |
| 28 | 0 | 44 | F | 3 | Uneducated | Single | Unknown | Blue | 34 | 5 | 2 | 2 | 10100 | 0 | 10100 | 0.525 | 1052 | 18 | 1.571 | 0.000 |
| 39 | 1 | 66 | F | 0 | Doctorate | Married | Unknown | Blue | 56 | 5 | 4 | 3 | 7882 | 605 | 7277 | 1.052 | 704 | 16 | 0.143 | 0.077 |
| 44 | 0 | 38 | F | 4 | Graduate | Single | Unknown | Blue | 28 | 2 | 3 | 3 | 9830 | 2055 | 7775 | 0.977 | 1042 | 23 | 0.917 | 0.209 |
| 58 | 0 | 44 | F | 5 | Graduate | Married | Unknown | Blue | 35 | 4 | 1 | 2 | 6273 | 978 | 5295 | 2.275 | 1359 | 25 | 1.083 | 0.156 |
credit_data=data.copy()
credit_data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691 | 777 | 11914 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256 | 864 | 7392 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418 | 0 | 3418 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313 | 2517 | 796 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716 | 0 | 4716 | 2.175 | 816 | 28 | 2.500 | 0.000 |
credit_data["Education_Level"].unique()
['High School', 'Graduate', 'Uneducated', NaN, 'College', 'Post-Graduate', 'Doctorate'] Categories (6, object): ['College', 'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated']
# we need to pass numerical values for each categorical column for KNN imputation so we will label encode them
gender = {"M": 0, "F": 1}
credit_data["Gender"] = credit_data["Gender"].map(gender)
education_level = {
"Graduate": 0,
"Uneducated": 1,
"High School": 2,
"College": 3,
"Post-Graduate": 4,
"Doctorate": 5
}
credit_data["Education_Level"] = credit_data["Education_Level"].map(education_level)
marital_Status = {
"Single": 0,
"Married": 1,
"Divorced": 2,
}
credit_data["Marital_Status"] = credit_data["Marital_Status"].map(marital_Status)
card_category = {
"Blue": 0,
"Silver": 1,
"Gold": 2,
"Platinum": 3
}
credit_data["Card_Category"] = credit_data["Card_Category"].map(card_category)
income_category = {
"Less than $40K": 0,
"$40K - $60K": 1,
"$80K - $120K": 2,
"$60K - $80K": 3,
"Unknown": 4,
"$120K +": 5,
}
credit_data["Income_Category"] = credit_data["Income_Category"].map(income_category)
from sklearn.impute import KNNImputer
X = credit_data.drop(["Attrition_Flag"], axis=1)
y = credit_data["Attrition_Flag"]
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075 Number of rows in validation data = 2026 Number of rows in test data = 2026
# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Transform the train data
X_val[reqd_col_for_impute] = imputer.transform(X_val[reqd_col_for_impute])
# Transform the test data
X_test[reqd_col_for_impute] = imputer.transform(X_test[reqd_col_for_impute])
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
## Function to inverse the encoding
## Function to inverse the encoding
def inverse_mapping(x, y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype("category")
X_val[y] = np.round(X_val[y]).map(inv_dict).astype("category")
X_test[y] = np.round(X_test[y]).map(inv_dict).astype("category")
inverse_mapping(gender, "Gender")
inverse_mapping(marital_Status, "Marital_Status")
inverse_mapping(education_level, "Education_Level")
inverse_mapping(card_category, "Card_Category")
inverse_mapping(income_category, "Income_Category")
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
F 3193 M 2882 Name: Gender, dtype: int64 ****************************** Graduate 1854 High School 1228 Uneducated 881 College 618 Post-Graduate 312 Doctorate 254 Name: Education_Level, dtype: int64 ****************************** Married 2819 Single 2369 Divorced 430 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 2129 $40K - $60K 1059 $80K - $120K 953 $60K - $80K 831 Unknown 654 $120K + 449 Name: Income_Category, dtype: int64 ****************************** Blue 5655 Silver 339 Gold 69 Platinum 12 Name: Card_Category, dtype: int64 ******************************
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)
F 1095 M 931 Name: Gender, dtype: int64 ****************************** Graduate 623 High School 404 Uneducated 306 College 199 Post-Graduate 101 Doctorate 99 Name: Education_Level, dtype: int64 ****************************** Married 960 Single 770 Divorced 156 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 736 $40K - $60K 361 $80K - $120K 293 $60K - $80K 279 Unknown 221 $120K + 136 Name: Income_Category, dtype: int64 ****************************** Blue 1905 Silver 97 Gold 21 Platinum 3 Name: Card_Category, dtype: int64 ******************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 30)
F 1070 M 956 Name: Gender, dtype: int64 ****************************** Graduate 651 High School 381 Uneducated 300 College 196 Post-Graduate 103 Doctorate 98 Name: Education_Level, dtype: int64 ****************************** Married 908 Single 804 Divorced 162 Name: Marital_Status, dtype: int64 ****************************** Less than $40K 696 $40K - $60K 370 $60K - $80K 292 $80K - $120K 289 Unknown 237 $120K + 142 Name: Income_Category, dtype: int64 ****************************** Blue 1876 Silver 119 Gold 26 Platinum 5 Name: Card_Category, dtype: int64 ******************************
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 30) (2026, 30) (2026, 30)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
)
# To impute missing values
from sklearn.impute import KNNImputer
# To build a logistic regression model
from sklearn.linear_model import LogisticRegression
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
models = [] # Empty list to store all the models
print("\n" "LogisticRegression Performance:" "\n")
# Appending models into the list
models.append(("Logistic", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
# models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n Train performance:\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train)) * 100
print("{}: {}".format(name, scores))
print("\n Validation performance:\n")
for name, model in models:
model.fit(X_val, y_val)
scores = recall_score(y_val, model.predict(X_val)) * 100
print("{}: {}".format(name, scores))
print("\n Test performance:\n")
for name, model in models:
model.fit(X_test, y_test)
scores = recall_score(y_test, model.predict(X_test)) * 100
print("{}: {}".format(name, scores))
LogisticRegression Performance: Cross-Validation Performance: Logistic: 41.89952904238618 Bagging: 78.27734170591313 Random forest: 76.73992673992673 GBM: 80.93877551020408 Adaboost: 81.3469387755102 dtree: 78.99529042386185 Train performance: Logistic: 44.87704918032787 Bagging: 98.46311475409836 Random forest: 100.0 GBM: 87.80737704918032 Adaboost: 83.81147540983606 dtree: 100.0 Validation performance: Logistic: 53.68098159509203 Bagging: 97.54601226993866 Random forest: 100.0 GBM: 94.1717791411043 Adaboost: 88.65030674846625 dtree: 100.0 Test performance: Logistic: 50.153846153846146 Bagging: 96.92307692307692 Random forest: 100.0 GBM: 95.07692307692308 Adaboost: 92.3076923076923 dtree: 100.0
lr = LogisticRegression(random_state=1)
lr.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
log_reg_model_train_perf = model_performance_classification_sklearn(
lr, X_train, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.875062 | 0.44877 | 0.664643 | 0.53578 |
# Calculating different metrics on validation set
log_reg_model_val_perf = model_performance_classification_sklearn(lr, X_val, y_val)
print("Validation performance:")
log_reg_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.884008 | 0.503067 | 0.691983 | 0.582593 |
# Calculating different metrics on validation set
log_reg_model_test_perf = model_performance_classification_sklearn(lr, X_test, y_test)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.876111 | 0.470769 | 0.659483 | 0.549372 |
# creating confusion matrix
confusion_matrix_sklearn(lr, X_val, y_val)
bag_cls = BaggingClassifier(random_state=1)
bag_cls.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=bag_cls, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
bag_cls_model_train_perf = model_performance_classification_sklearn(
bag_cls, X_train, y_train
)
print("Training performance:")
bag_cls_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997202 | 0.984631 | 0.997923 | 0.991233 |
# Calculating different metrics on validation set
bag_cls_model_val_perf = model_performance_classification_sklearn(bag_cls, X_val, y_val)
print("Validation performance:")
bag_cls_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95311 | 0.806748 | 0.891525 | 0.847021 |
# Calculating different metrics on validation set
bag_cls_model_test_perf = model_performance_classification_sklearn(bag_cls, X_test, y_test)
print("Test performance:")
bag_cls_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958045 | 0.861538 | 0.875 | 0.868217 |
dtree_cls = DecisionTreeClassifier(random_state=1)
dtree_cls.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=dtree_cls, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
dtree_cls_model_train_perf = model_performance_classification_sklearn(
dtree_cls, X_train, y_train
)
print("Training performance:")
dtree_cls_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on validation set
dtree_cls_model_val_perf = model_performance_classification_sklearn(dtree_cls, X_val, y_val)
print("Validation performance:")
dtree_cls_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937808 | 0.809816 | 0.804878 | 0.807339 |
# Calculating different metrics on test set
dtree_cls_model_test_perf = model_performance_classification_sklearn(dtree_cls, X_test, y_test)
print("Test performance:")
dtree_cls_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.933366 | 0.821538 | 0.776163 | 0.798206 |
ada_cls = AdaBoostClassifier(random_state=1)
ada_cls.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=ada_cls, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
ada_cls_model_train_perf = model_performance_classification_sklearn(
ada_cls, X_train, y_train
)
print("Training performance:")
ada_cls_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958519 | 0.838115 | 0.89693 | 0.866525 |
# Calculating different metrics on validation set
ada_cls_model_val_perf = model_performance_classification_sklearn(ada_cls, X_val, y_val)
print("Validation performance:")
ada_cls_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.96002 | 0.858896 | 0.888889 | 0.873635 |
# Calculating different metrics on validation set
ada_cls_model_test_perf = model_performance_classification_sklearn(ada_cls, X_test, y_test)
print("Test performance:")
ada_cls_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.963475 | 0.895385 | 0.879154 | 0.887195 |
gboost_cls = GradientBoostingClassifier(random_state=1)
gboost_cls.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=gboost_cls, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
gboost_cls_model_train_perf = model_performance_classification_sklearn(
gboost_cls, X_train, y_train
)
print("Training performance:")
gboost_cls_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973663 | 0.878074 | 0.954343 | 0.914621 |
# Calculating different metrics on validation set
gboost_cls_model_val_perf = model_performance_classification_sklearn(gboost_cls, X_val, y_val)
print("Validation performance:")
gboost_cls_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.967917 | 0.858896 | 0.936455 | 0.896 |
# Calculating different metrics on test set
gboost_cls_model_test_perf = model_performance_classification_sklearn(gboost_cls, X_test, y_test)
print("Test performance:")
gboost_cls_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968904 | 0.873846 | 0.928105 | 0.900158 |
random_forest_cls = RandomForestClassifier(random_state=1)
random_forest_cls.fit(X_train, y_train)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_bfr = cross_val_score(
estimator=random_forest_cls, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
random_forest_cls_train_perf = model_performance_classification_sklearn(
random_forest_cls, X_train, y_train
)
print("Training performance:")
random_forest_cls_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Calculating different metrics on validation set
random_forest_cls_model_val_perf = model_performance_classification_sklearn(random_forest_cls, X_val, y_val)
print("Validation performance:")
random_forest_cls_model_val_perf
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958045 | 0.800613 | 0.928826 | 0.859967 |
# Calculating different metrics on test set
random_forest_cls_model_test_perf = model_performance_classification_sklearn(random_forest_cls, X_test, y_test)
print("Test performance:")
random_forest_cls_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959526 | 0.812308 | 0.926316 | 0.865574 |
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099 After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of train_X: (10198, 30) After Oversampling, the shape of train_y: (10198,)
log_reg_over = LogisticRegression(random_state=1)
# Training the basic logistic regression model with training set
log_reg_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=log_reg_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
log_reg_over_train_perf = model_performance_classification_sklearn(
log_reg_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
log_reg_over_val_perf = model_performance_classification_sklearn(
log_reg_over, X_val, y_val
)
# Calculating different metrics on train set
log_reg_over_test_perf = model_performance_classification_sklearn(
log_reg_over, X_test, y_test
)
bag_cls_over = BaggingClassifier(random_state=1)
# Training the basic logistic regression model with training set
bag_cls_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=bag_cls_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
bag_cls_over_train_perf = model_performance_classification_sklearn(
bag_cls_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
bag_cls_over_val_perf = model_performance_classification_sklearn(
bag_cls_over, X_val, y_val
)
# Calculating different metrics on train set
bag_cls_over_test_perf = model_performance_classification_sklearn(
bag_cls_over, X_test, y_test
)
ada_cls_over = AdaBoostClassifier(random_state=1)
# Training the basic logistic regression model with training set
ada_cls_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
# Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=ada_cls_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
ada_cls_over_train_perf = model_performance_classification_sklearn(
ada_cls_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
ada_cls_over_val_perf = model_performance_classification_sklearn(
ada_cls_over, X_val, y_val
)
# Calculating different metrics on train set
ada_cls_over_test_perf = model_performance_classification_sklearn(
ada_cls_over, X_test, y_test
)
dtree_cls_over = DecisionTreeClassifier(random_state=1)
# Training the basic logistic regression model with training set
dtree_cls_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=dtree_cls_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
dtree_cls_over_train_perf = model_performance_classification_sklearn(
dtree_cls_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
dtree_cls_over_val_perf = model_performance_classification_sklearn(
dtree_cls_over, X_val, y_val
)
# Calculating different metrics on train set
dtree_cls_over_test_perf = model_performance_classification_sklearn(
dtree_cls_over, X_test, y_test
)
# dtree_cls_over = DecisionTreeClassifier(random_state=1)
gboost_cls_over = GradientBoostingClassifier(random_state=1)
# Training the basic logistic regression model with training set
gboost_cls_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=gboost_cls_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
gboost_cls_over_train_perf = model_performance_classification_sklearn(
gboost_cls_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
gboost_cls_over_val_perf = model_performance_classification_sklearn(
gboost_cls_over, X_val, y_val
)
# Calculating different metrics on train set
gboost_cls_over_test_perf = model_performance_classification_sklearn(
gboost_cls_over, X_test, y_test
)
random_forest_cls_over = RandomForestClassifier(random_state=1)
# Training the basic logistic regression model with training set
random_forest_cls_over.fit(X_train_over, y_train_over)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result_over = cross_val_score(
estimator=random_forest_cls_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
random_forest_cls_over_train_perf = model_performance_classification_sklearn(
random_forest_cls_over, X_train_over, y_train_over
)
# Calculating different metrics on train set
random_forest_cls_over_val_perf = model_performance_classification_sklearn(
random_forest_cls_over, X_val, y_val
)
# Calculating different metrics on train set
random_forest_cls_over_test_perf = model_performance_classification_sklearn(
random_forest_cls_over, X_test, y_test
)
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
ada_cls_under = AdaBoostClassifier(random_state=1)
gboost_cls_under = GradientBoostingClassifier(random_state=1)
random_forest_cls_under = RandomForestClassifier(random_state=1)
dtree_cls_under = DecisionTreeClassifier(random_state=1)
bag_cls_under = BaggingClassifier(random_state=1)
log_reg_under = LogisticRegression(random_state=1)
print("Adaboost:")
ada_cls_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=ada_cls_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
ada_cls_under_train_perf = model_performance_classification_sklearn(
ada_cls_under, X_train_un, y_train_un
)
print("Training performance:")
ada_cls_under_train_perf
# Calculating different metrics on validation set
ada_cls_under_val_perf = model_performance_classification_sklearn(
ada_cls_under, X_val, y_val
)
print("Validation performance:")
ada_cls_under_val_perf
# Calculating different metrics on validation set
ada_cls_under_test_perf = model_performance_classification_sklearn(
ada_cls_under, X_test, y_test
)
print("Test performance:")
ada_cls_under_test_perf
# gboost_cls_under
print("Gradient Boost:")
gboost_cls_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=gboost_cls_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
gboost_cls_under_train_perf = model_performance_classification_sklearn(
gboost_cls_under, X_train_un, y_train_un
)
print("Training performance:")
gboost_cls_under_train_perf
# Calculating different metrics on validation set
gboost_cls_under_val_perf = model_performance_classification_sklearn(
gboost_cls_under, X_val, y_val
)
print("Validation performance:")
gboost_cls_under_val_perf
# Calculating different metrics on validation set
gboost_cls_under_test_perf = model_performance_classification_sklearn(
gboost_cls_under, X_test, y_test
)
print("Test performance:")
gboost_cls_under_test_perf
print("Random Forest:")
random_forest_cls_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=random_forest_cls_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
random_forest_cls_under_train_perf = model_performance_classification_sklearn(
random_forest_cls_under, X_train_un, y_train_un
)
print("Training performance:")
random_forest_cls_under_train_perf
# Calculating different metrics on validation set
random_forest_cls_under_val_perf = model_performance_classification_sklearn(
random_forest_cls_under, X_val, y_val
)
print("Validation performance:")
random_forest_cls_under_val_perf
# Calculating different metrics on validation set
random_forest_cls_under_test_perf = model_performance_classification_sklearn(
random_forest_cls_under, X_test, y_test
)
print("Test performance:")
random_forest_cls_under_test_perf
print("dtree_cls_under:")
dtree_cls_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=dtree_cls_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
dtree_cls_under_train_perf = model_performance_classification_sklearn(
dtree_cls_under, X_train_un, y_train_un
)
print("Training performance:")
dtree_cls_under_train_perf
# Calculating different metrics on validation set
dtree_cls_under_val_perf = model_performance_classification_sklearn(
dtree_cls_under, X_val, y_val
)
print("Validation performance:")
ada_cls_under_val_perf
# Calculating different metrics on validation set
dtree_cls_under_test_perf = model_performance_classification_sklearn(
dtree_cls_under, X_test, y_test
)
print("Test performance:\n",dtree_cls_under_test_perf)
# gboost_cls_under
print("Bagging bag_cls_under:")
bag_cls_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=bag_cls_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
bag_cls_under_train_perf = model_performance_classification_sklearn(
bag_cls_under, X_train_un, y_train_un
)
print("Training performance:\n",bag_cls_under_train_perf)
# Calculating different metrics on validation set
bag_cls_under_val_perf = model_performance_classification_sklearn(
bag_cls_under, X_val, y_val
)
print("Validation performance:\n",bag_cls_under_val_perf)
# Calculating different metrics on validation set
bag_cls_under_test_perf = model_performance_classification_sklearn(
bag_cls_under, X_test, y_test
)
print("Test performance:\n",bag_cls_under_test_perf)
print("Logistic Regression:")
log_reg_under.fit(X_train_un, y_train_un)
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
)
cv_result_under = cross_val_score(
estimator=log_reg_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
# Calculating different metrics on train set
log_reg_under_train_perf = model_performance_classification_sklearn(
log_reg_under, X_train_un, y_train_un
)
print("Training performance:\n",log_reg_under_train_perf)
# Calculating different metrics on validation set
log_reg_under_val_perf = model_performance_classification_sklearn(
log_reg_under, X_val, y_val
)
print("Validation performance:\n",log_reg_under_val_perf)
# Calculating different metrics on validation set
log_reg_under_test_perf = model_performance_classification_sklearn(
log_reg_under, X_test, y_test
)
print("Test performance:\n",log_reg_under_test_perf)
Adaboost:
Training performance:
Validation performance:
Test performance:
Gradient Boost:
Training performance:
Validation performance:
Test performance:
Random Forest:
Training performance:
Validation performance:
Test performance:
dtree_cls_under:
Training performance:
Validation performance:
Test performance:
Accuracy Recall Precision F1
0 0.888944 0.932308 0.598814 0.729242
Bagging bag_cls_under:
Training performance:
Accuracy Recall Precision F1
0 0.994877 0.991803 0.997938 0.994861
Validation performance:
Accuracy Recall Precision F1
0 0.924975 0.929448 0.701389 0.799472
Test performance:
Accuracy Recall Precision F1
0 0.91461 0.950769 0.66309 0.78129
Logistic Regression:
Training performance:
Accuracy Recall Precision F1
0 0.815574 0.822746 0.811111 0.816887
Validation performance:
Accuracy Recall Precision F1
0 0.8154 0.828221 0.459184 0.59081
Test performance:
Accuracy Recall Precision F1
0 0.807996 0.846154 0.447883 0.585729
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_over_train_perf.T,
log_reg_under_train_perf.T,
bag_cls_model_train_perf.T,
bag_cls_over_train_perf.T,
bag_cls_under_train_perf.T,
ada_cls_model_train_perf.T,
ada_cls_over_train_perf.T,
ada_cls_under_train_perf.T,
dtree_cls_model_train_perf.T,
dtree_cls_over_train_perf.T,
dtree_cls_under_train_perf.T,
gboost_cls_model_train_perf.T,
gboost_cls_over_train_perf.T,
gboost_cls_under_train_perf.T,
random_forest_cls_train_perf.T,
random_forest_cls_over_train_perf.T,
random_forest_cls_under_train_perf.T
# log_reg_over_train_perf.T,
# log_reg_reg_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression",
"Logistic Regression oversampled",
"Logistic Regression undersampled",
"Bagging",
"Bagging oversampled",
"Bagging undersampled",
"Ada Boost",
"Ada Boost oversampled",
"Ada Boost undersampled",
"Decision Tree",
"Decision Tree oversampled",
"Decision Tree undersampled",
"Gradient Boost",
"Gradient Boost oversampled",
"Gradient Boost undersampled",
"Random Forest",
"Random Forest oversampled",
"Random Forest undersampled",
]
print("Training performance comparison:")
models_train_comp_df.T
Training performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression | 0.875062 | 0.448770 | 0.664643 | 0.535780 |
| Logistic Regression oversampled | 0.831438 | 0.835066 | 0.829050 | 0.832047 |
| Logistic Regression undersampled | 0.815574 | 0.822746 | 0.811111 | 0.816887 |
| Bagging | 0.997202 | 0.984631 | 0.997923 | 0.991233 |
| Bagging oversampled | 0.998137 | 0.997647 | 0.998626 | 0.998136 |
| Bagging undersampled | 0.994877 | 0.991803 | 0.997938 | 0.994861 |
| Ada Boost | 0.958519 | 0.838115 | 0.896930 | 0.866525 |
| Ada Boost oversampled | 0.960482 | 0.964111 | 0.957165 | 0.960625 |
| Ada Boost undersampled | 0.949795 | 0.952869 | 0.947047 | 0.949949 |
| Decision Tree | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Decision Tree oversampled | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Decision Tree undersampled | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Gradient Boost | 0.973663 | 0.878074 | 0.954343 | 0.914621 |
| Gradient Boost oversampled | 0.976956 | 0.980977 | 0.973152 | 0.977049 |
| Gradient Boost undersampled | 0.973361 | 0.979508 | 0.967611 | 0.973523 |
| Random Forest | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Random Forest oversampled | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Random Forest undersampled | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# training performance comparison
models_val_comp_df = pd.concat(
[
log_reg_model_val_perf.T,
log_reg_over_val_perf.T,
log_reg_under_val_perf.T,
bag_cls_model_val_perf.T,
bag_cls_over_val_perf.T,
bag_cls_under_val_perf.T,
ada_cls_model_val_perf.T,
ada_cls_over_val_perf.T,
ada_cls_under_val_perf.T,
dtree_cls_model_val_perf.T,
dtree_cls_over_val_perf.T,
dtree_cls_under_val_perf.T,
gboost_cls_model_val_perf.T,
gboost_cls_over_val_perf.T,
gboost_cls_under_val_perf.T,
random_forest_cls_model_val_perf.T,
random_forest_cls_over_val_perf.T,
random_forest_cls_under_val_perf.T
],
axis=1,
)
models_val_comp_df.columns = [
"Logistic Regression",
"Logistic Regression oversampled",
"Logistic Regression undersampled",
"Bagging",
"Bagging oversampled",
"Bagging undersampled",
"Ada Boost",
"Ada Boost oversampled",
"Ada Boost undersampled",
"Decision Tree",
"Decision Tree oversampled",
"Decision Tree undersampled",
"Gradient Boost",
"Gradient Boost oversampled",
"Gradient Boost undersampled",
"Random Forest",
"Random Forest oversampled",
"Random Forest undersampled",
]
print("Validation performance comparison:")
models_val_comp_df.T
Validation performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression | 0.884008 | 0.503067 | 0.691983 | 0.582593 |
| Logistic Regression oversampled | 0.824778 | 0.803681 | 0.473779 | 0.596132 |
| Logistic Regression undersampled | 0.815400 | 0.828221 | 0.459184 | 0.590810 |
| Bagging | 0.953110 | 0.806748 | 0.891525 | 0.847021 |
| Bagging oversampled | 0.939783 | 0.840491 | 0.796512 | 0.817910 |
| Bagging undersampled | 0.924975 | 0.929448 | 0.701389 | 0.799472 |
| Ada Boost | 0.960020 | 0.858896 | 0.888889 | 0.873635 |
| Ada Boost oversampled | 0.940770 | 0.858896 | 0.790960 | 0.823529 |
| Ada Boost undersampled | 0.928924 | 0.960123 | 0.704955 | 0.812987 |
| Decision Tree | 0.937808 | 0.809816 | 0.804878 | 0.807339 |
| Decision Tree oversampled | 0.927937 | 0.828221 | 0.750000 | 0.787172 |
| Decision Tree undersampled | 0.892399 | 0.904908 | 0.612033 | 0.730198 |
| Gradient Boost | 0.967917 | 0.858896 | 0.936455 | 0.896000 |
| Gradient Boost oversampled | 0.954097 | 0.883436 | 0.839650 | 0.860987 |
| Gradient Boost undersampled | 0.937808 | 0.960123 | 0.734742 | 0.832447 |
| Random Forest | 0.958045 | 0.800613 | 0.928826 | 0.859967 |
| Random Forest oversampled | 0.952122 | 0.843558 | 0.856698 | 0.850077 |
| Random Forest undersampled | 0.930405 | 0.929448 | 0.719715 | 0.811245 |
# Testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_over_test_perf.T,
log_reg_under_test_perf.T,
bag_cls_model_test_perf.T,
bag_cls_over_test_perf.T,
bag_cls_under_test_perf.T,
ada_cls_model_test_perf.T,
ada_cls_over_test_perf.T,
ada_cls_under_test_perf.T,
dtree_cls_model_test_perf.T,
dtree_cls_over_test_perf.T,
dtree_cls_under_test_perf.T,
gboost_cls_model_test_perf.T,
gboost_cls_over_test_perf.T,
gboost_cls_under_test_perf.T,
random_forest_cls_model_test_perf.T,
random_forest_cls_over_test_perf.T,
random_forest_cls_under_test_perf.T
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression",
"Logistic Regression oversampled",
"Logistic Regression undersampled",
"Bagging",
"Bagging oversampled",
"Bagging undersampled",
"Ada Boost",
"Ada Boost oversampled",
"Ada Boost undersampled",
"Decision Tree",
"Decision Tree oversampled",
"Decision Tree undersampled",
"Gradient Boost",
"Gradient Boost oversampled",
"Gradient Boost undersampled",
"Random Forest",
"Random Forest oversampled",
"Random Forest undersampled",
]
print("Test performance comparison:")
models_test_comp_df.T
Test performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression | 0.876111 | 0.470769 | 0.659483 | 0.549372 |
| Logistic Regression oversampled | 0.814906 | 0.793846 | 0.455830 | 0.579125 |
| Logistic Regression undersampled | 0.807996 | 0.846154 | 0.447883 | 0.585729 |
| Bagging | 0.958045 | 0.861538 | 0.875000 | 0.868217 |
| Bagging oversampled | 0.949161 | 0.889231 | 0.811798 | 0.848752 |
| Bagging undersampled | 0.914610 | 0.950769 | 0.663090 | 0.781290 |
| Ada Boost | 0.963475 | 0.895385 | 0.879154 | 0.887195 |
| Ada Boost oversampled | 0.943238 | 0.901538 | 0.779255 | 0.835949 |
| Ada Boost undersampled | 0.927937 | 0.960000 | 0.701124 | 0.810390 |
| Decision Tree | 0.933366 | 0.821538 | 0.776163 | 0.798206 |
| Decision Tree oversampled | 0.919052 | 0.843077 | 0.708010 | 0.769663 |
| Decision Tree undersampled | 0.888944 | 0.932308 | 0.598814 | 0.729242 |
| Gradient Boost | 0.968904 | 0.873846 | 0.928105 | 0.900158 |
| Gradient Boost oversampled | 0.960513 | 0.932308 | 0.839335 | 0.883382 |
| Gradient Boost undersampled | 0.933366 | 0.969231 | 0.715909 | 0.823529 |
| Random Forest | 0.959526 | 0.812308 | 0.926316 | 0.865574 |
| Random Forest oversampled | 0.960020 | 0.880000 | 0.871951 | 0.875957 |
| Random Forest undersampled | 0.926456 | 0.953846 | 0.698198 | 0.806242 |
# defining model
model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8472893772893773:
# building model with best parameters
adb_tuned = AdaBoostClassifier(
n_estimators=30,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
)
# Fit the model on training data
adb_tuned.fit(X_train, y_train)
# Calculating different metrics on train set
adb_tuned_train_df = model_performance_classification_sklearn(
adb_tuned, X_train, y_train
)
print("Training performance:")
adb_tuned_train_df
# Calculating different metrics on train set
adb_tuned_val_df = model_performance_classification_sklearn(
adb_tuned, X_val, y_val
)
print("Validation performance:")
adb_tuned_val_df
# Calculating different metrics on train set
adb_tuned_test_df = model_performance_classification_sklearn(
adb_tuned, X_test, y_test
)
print("Test performance:")
adb_tuned_test_df
Training performance: Validation performance: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970879 | 0.916923 | 0.90303 | 0.909924 |
importances = adb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
print(feature_names)
print(np.argsort(importances))
['Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'] [14 27 26 25 24 23 22 19 17 16 15 28 29 21 1 2 18 13 20 6 4 5 0 3 7 8 12 9 11 10]
# Gradient Boost
# gbc_tuned = GradientBoostingClassifier()
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(gboost_cls, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
{'subsample': 0.9, 'n_estimators': 250, 'max_features': 0.7}
gbc_tuned = GradientBoostingClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"n_estimators": [250],
"subsample":[0.9],
"max_features":[0.7]
}
# Fit the model on training data
gbc_tuned.fit(X_train, y_train)
# Calculating different metrics on train set
gbc_tuned_train_df = model_performance_classification_sklearn(
gbc_tuned, X_train, y_train
)
print("Training performance:")
gbc_tuned_train_df
# Calculating different metrics on train set
gbc_tuned_val_df = model_performance_classification_sklearn(
gbc_tuned, X_val, y_val
)
print("Validation performance:")
gbc_tuned_val_df
# Calculating different metrics on train set
gbc_tuned_test_df = model_performance_classification_sklearn(
gbc_tuned, X_test, y_test
)
print("Test performance:")
gbc_tuned_test_df
Training performance: Validation performance: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968904 | 0.873846 | 0.928105 | 0.900158 |
# Gradient Boost
# gbc_tuned = GradientBoostingClassifier()
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(gboost_cls_over, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
{'subsample': 0.9, 'n_estimators': 250, 'max_features': 0.7}
%%time
# Choose the type of classifier.
# rf_tuned = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
"max_depth":np.arange(3,4,5),
"class_weight" : ['balanced', 'balanced_subsample'],
"min_impurity_decrease":[0.001, 0.002, 0.003]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the random search
grid_obj = RandomizedSearchCV(random_forest_cls_over, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
grid_obj = grid_obj.fit(X_train, y_train)
# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits CPU times: user 1.14 s, sys: 198 ms, total: 1.34 s Wall time: 15.8 s
{'n_estimators': 150,
'min_samples_leaf': 9,
'min_impurity_decrease': 0.003,
'max_samples': 0.4,
'max_features': 0.6000000000000001,
'max_depth': 3,
'class_weight': 'balanced_subsample'}
# Set the clf to the best combination of parameters
rf_tuned = RandomForestClassifier(
class_weight="balanced_subsample",
max_features=0.6000000000000001,
max_samples=0.4,
min_samples_leaf=9,
n_estimators=150,
random_state=1,
max_depth=3,
min_impurity_decrease=0.003,
)
# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
# Calculating different metrics on train set
rf_tuned_train_df = model_performance_classification_sklearn(
rf_tuned, X_train, y_train
)
print("Training performance:")
adb_tuned_train_df
# Calculating different metrics on train set
rf_tuned_val_df = model_performance_classification_sklearn(
rf_tuned, X_val, y_val
)
print("Validation performance:")
rf_tuned_val_df
# Calculating different metrics on train set
rf_tuned_test_df = model_performance_classification_sklearn(
rf_tuned, X_test, y_test
)
print("Test performance:")
rf_tuned_test_df
Training performance: Validation performance: Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.863771 | 0.910769 | 0.54512 | 0.682028 |
# Testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_over_test_perf.T,
log_reg_under_test_perf.T,
bag_cls_model_test_perf.T,
bag_cls_over_test_perf.T,
bag_cls_under_test_perf.T,
ada_cls_model_test_perf.T,
ada_cls_over_test_perf.T,
ada_cls_under_test_perf.T,
dtree_cls_model_test_perf.T,
dtree_cls_over_test_perf.T,
dtree_cls_under_test_perf.T,
gboost_cls_model_test_perf.T,
gboost_cls_over_test_perf.T,
gboost_cls_under_test_perf.T,
random_forest_cls_model_test_perf.T,
random_forest_cls_over_test_perf.T,
random_forest_cls_under_test_perf.T
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression",
"Logistic Regression oversampled",
"Logistic Regression undersampled",
"Bagging",
"Bagging oversampled",
"Bagging undersampled",
"Ada Boost",
"Ada Boost oversampled",
"Ada Boost undersampled",
"Decision Tree",
"Decision Tree oversampled",
"Decision Tree undersampled",
"Gradient Boost",
"Gradient Boost oversampled",
"Gradient Boost undersampled",
"Random Forest",
"Random Forest oversampled",
"Random Forest undersampled",
]
print("Test performance comparison:")
models_test_comp_df.T
Test performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Logistic Regression | 0.876111 | 0.470769 | 0.659483 | 0.549372 |
| Logistic Regression oversampled | 0.814906 | 0.793846 | 0.455830 | 0.579125 |
| Logistic Regression undersampled | 0.807996 | 0.846154 | 0.447883 | 0.585729 |
| Bagging | 0.958045 | 0.861538 | 0.875000 | 0.868217 |
| Bagging oversampled | 0.949161 | 0.889231 | 0.811798 | 0.848752 |
| Bagging undersampled | 0.914610 | 0.950769 | 0.663090 | 0.781290 |
| Ada Boost | 0.963475 | 0.895385 | 0.879154 | 0.887195 |
| Ada Boost oversampled | 0.943238 | 0.901538 | 0.779255 | 0.835949 |
| Ada Boost undersampled | 0.927937 | 0.960000 | 0.701124 | 0.810390 |
| Decision Tree | 0.933366 | 0.821538 | 0.776163 | 0.798206 |
| Decision Tree oversampled | 0.919052 | 0.843077 | 0.708010 | 0.769663 |
| Decision Tree undersampled | 0.888944 | 0.932308 | 0.598814 | 0.729242 |
| Gradient Boost | 0.968904 | 0.873846 | 0.928105 | 0.900158 |
| Gradient Boost oversampled | 0.960513 | 0.932308 | 0.839335 | 0.883382 |
| Gradient Boost undersampled | 0.933366 | 0.969231 | 0.715909 | 0.823529 |
| Random Forest | 0.959526 | 0.812308 | 0.926316 | 0.865574 |
| Random Forest oversampled | 0.960020 | 0.880000 | 0.871951 | 0.875957 |
| Random Forest undersampled | 0.926456 | 0.953846 | 0.698198 | 0.806242 |
## Ada Boost gives over best performance
adb_tuned_test_df
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970879 | 0.916923 | 0.90303 | 0.909924 |
[CV] END ..max_features=0.9, n_estimators=250, subsample=0.8; total time= 2.9s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.8; total time= 1.4s [CV] END ......max_features=1, n_estimators=200, subsample=1; total time= 0.4s [CV] END ......max_features=1, n_estimators=200, subsample=1; total time= 0.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.8; total time= 1.8s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.8; total time= 2.8s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.9; total time= 3.2s [CV] END ....max_features=1, n_estimators=250, subsample=0.9; total time= 0.6s [CV] END ....max_features=0.9, n_estimators=150, subsample=1; total time= 2.2s [CV] END ..max_features=0.7, n_estimators=250, subsample=0.9; total time= 2.8s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.8; total time= 2.9s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.8; total time= 1.4s [CV] END ....max_features=0.9, n_estimators=100, subsample=1; total time= 1.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.9; total time= 2.0s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ....max_features=1, n_estimators=150, subsample=0.9; total time= 0.4s [CV] END ....max_features=0.9, n_estimators=250, subsample=1; total time= 3.7s [CV] END ..max_features=0.8, n_estimators=100, subsample=0.9; total time= 1.3s [CV] END ....max_features=1, n_estimators=250, subsample=0.9; total time= 0.7s [CV] END ....max_features=0.9, n_estimators=150, subsample=1; total time= 2.2s [CV] END ..max_features=0.7, n_estimators=250, subsample=0.9; total time= 2.8s [CV] END ....max_features=0.8, n_estimators=250, subsample=1; total time= 3.0s [CV] END ..max_features=0.9, n_estimators=200, subsample=0.9; total time= 2.5s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.8; total time= 1.8s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.9; total time= 3.5s [CV] END ..max_features=0.9, n_estimators=100, subsample=0.8; total time= 1.3s [CV] END ..max_features=0.9, n_estimators=100, subsample=0.8; total time= 1.3s [CV] END ..max_features=0.8, n_estimators=100, subsample=0.9; total time= 1.3s [CV] END ....max_features=0.8, n_estimators=150, subsample=1; total time= 2.0s [CV] END ....max_features=0.9, n_estimators=150, subsample=1; total time= 2.3s [CV] END ....max_features=1, n_estimators=200, subsample=0.8; total time= 0.6s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.8; total time= 2.2s [CV] END ....max_features=0.8, n_estimators=250, subsample=1; total time= 3.1s [CV] END ..max_features=0.9, n_estimators=200, subsample=0.9; total time= 2.5s [CV] END ......max_features=1, n_estimators=250, subsample=1; total time= 0.5s [CV] END ......max_features=1, n_estimators=250, subsample=1; total time= 0.5s [CV] END ......max_features=1, n_estimators=250, subsample=1; total time= 0.5s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.9; total time= 3.5s [CV] END ....max_features=0.9, n_estimators=250, subsample=1; total time= 3.7s [CV] END ....max_features=1, n_estimators=150, subsample=0.8; total time= 0.4s [CV] END ....max_features=1, n_estimators=150, subsample=0.8; total time= 0.4s [CV] END ....max_features=0.8, n_estimators=150, subsample=1; total time= 2.0s [CV] END ....max_features=0.8, n_estimators=100, subsample=1; total time= 1.4s [CV] END ....max_features=1, n_estimators=200, subsample=0.8; total time= 0.6s [CV] END ....max_features=1, n_estimators=200, subsample=0.8; total time= 0.5s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.8; total time= 2.2s [CV] END ......max_features=1, n_estimators=100, subsample=1; total time= 0.2s [CV] END ......max_features=1, n_estimators=100, subsample=1; total time= 0.2s [CV] END ......max_features=1, n_estimators=100, subsample=1; total time= 0.2s [CV] END ....max_features=1, n_estimators=100, subsample=0.8; total time= 0.2s [CV] END ....max_features=1, n_estimators=100, subsample=0.8; total time= 0.2s [CV] END ....max_features=0.7, n_estimators=100, subsample=1; total time= 1.1s [CV] END ....max_features=0.7, n_estimators=100, subsample=1; total time= 1.1s [CV] END ..max_features=0.9, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ......max_features=1, n_estimators=250, subsample=1; total time= 0.5s [CV] END ......max_features=1, n_estimators=250, subsample=1; total time= 0.5s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.8; total time= 2.8s [CV] END ....max_features=1, n_estimators=150, subsample=0.9; total time= 0.4s [CV] END ....max_features=1, n_estimators=150, subsample=0.9; total time= 0.4s [CV] END ....max_features=0.9, n_estimators=250, subsample=1; total time= 3.7s [CV] END ....max_features=1, n_estimators=150, subsample=0.8; total time= 0.4s [CV] END ....max_features=0.8, n_estimators=150, subsample=1; total time= 1.9s [CV] END ....max_features=0.9, n_estimators=150, subsample=1; total time= 2.3s [CV] END ....max_features=0.9, n_estimators=200, subsample=1; total time= 2.8s [CV] END ....max_features=0.8, n_estimators=250, subsample=1; total time= 3.1s [CV] END ..max_features=0.9, n_estimators=200, subsample=0.9; total time= 2.5s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.9; total time= 1.9s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.9; total time= 3.5s [CV] END ..max_features=0.9, n_estimators=100, subsample=0.8; total time= 1.3s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.9; total time= 3.1s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.9; total time= 1.7s [CV] END ....max_features=0.8, n_estimators=100, subsample=1; total time= 1.4s [CV] END ....max_features=1, n_estimators=200, subsample=0.8; total time= 0.6s [CV] END ....max_features=0.9, n_estimators=200, subsample=1; total time= 2.8s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.8; total time= 2.8s [CV] END ....max_features=0.7, n_estimators=100, subsample=1; total time= 1.1s [CV] END ....max_features=0.9, n_estimators=100, subsample=1; total time= 1.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.8; total time= 1.8s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.8; total time= 2.8s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.9; total time= 3.1s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.9; total time= 1.7s [CV] END ....max_features=0.8, n_estimators=100, subsample=1; total time= 1.4s [CV] END ....max_features=1, n_estimators=200, subsample=0.8; total time= 0.6s [CV] END ....max_features=0.9, n_estimators=200, subsample=1; total time= 2.8s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.8; total time= 2.9s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.8; total time= 1.4s [CV] END ....max_features=0.9, n_estimators=100, subsample=1; total time= 1.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.9; total time= 1.9s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.9; total time= 3.5s [CV] END ..max_features=0.9, n_estimators=100, subsample=0.8; total time= 1.3s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.9; total time= 3.1s [CV] END ....max_features=1, n_estimators=250, subsample=0.9; total time= 0.7s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.9; total time= 1.7s [CV] END ....max_features=0.8, n_estimators=100, subsample=1; total time= 1.4s [CV] END ....max_features=0.9, n_estimators=200, subsample=1; total time= 2.8s [CV] END ......max_features=1, n_estimators=100, subsample=1; total time= 0.2s [CV] END ......max_features=1, n_estimators=100, subsample=1; total time= 0.2s [CV] END ....max_features=1, n_estimators=100, subsample=0.8; total time= 0.3s [CV] END ....max_features=1, n_estimators=100, subsample=0.8; total time= 0.2s [CV] END ....max_features=1, n_estimators=100, subsample=0.8; total time= 0.2s [CV] END ....max_features=0.7, n_estimators=100, subsample=1; total time= 1.1s [CV] END ....max_features=0.7, n_estimators=100, subsample=1; total time= 1.1s [CV] END ....max_features=0.9, n_estimators=100, subsample=1; total time= 1.3s [CV] END ......max_features=1, n_estimators=200, subsample=1; total time= 0.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.8; total time= 1.8s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.8; total time= 3.0s [CV] END ....max_features=1, n_estimators=150, subsample=0.9; total time= 0.4s [CV] END ....max_features=0.9, n_estimators=250, subsample=1; total time= 3.7s [CV] END ....max_features=1, n_estimators=150, subsample=0.8; total time= 0.4s [CV] END ....max_features=1, n_estimators=150, subsample=0.8; total time= 0.4s [CV] END ....max_features=0.8, n_estimators=150, subsample=1; total time= 2.0s [CV] END ....max_features=0.9, n_estimators=150, subsample=1; total time= 2.3s [CV] END ....max_features=0.9, n_estimators=200, subsample=1; total time= 2.8s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.8; total time= 2.8s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.8; total time= 1.4s [CV] END ....max_features=0.9, n_estimators=100, subsample=1; total time= 1.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.9; total time= 2.0s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.9; total time= 2.5s [CV] END ....max_features=1, n_estimators=150, subsample=0.9; total time= 0.4s [CV] END ....max_features=0.9, n_estimators=250, subsample=1; total time= 3.7s [CV] END ..max_features=0.8, n_estimators=100, subsample=0.9; total time= 1.3s [CV] END ....max_features=1, n_estimators=250, subsample=0.9; total time= 0.7s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.9; total time= 1.7s [CV] END ..max_features=0.7, n_estimators=250, subsample=0.9; total time= 2.8s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.8; total time= 1.9s [CV] END ....max_features=0.8, n_estimators=250, subsample=1; total time= 3.0s [CV] END ..max_features=0.9, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.9; total time= 2.0s [CV] END ..max_features=0.9, n_estimators=250, subsample=0.9; total time= 3.5s [CV] END ..max_features=0.9, n_estimators=100, subsample=0.8; total time= 1.3s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.9; total time= 3.1s [CV] END ....max_features=1, n_estimators=250, subsample=0.9; total time= 0.6s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.9; total time= 1.7s [CV] END ..max_features=0.7, n_estimators=250, subsample=0.9; total time= 2.9s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.8; total time= 1.9s [CV] END ....max_features=0.8, n_estimators=250, subsample=1; total time= 3.0s [CV] END ..max_features=0.7, n_estimators=150, subsample=0.8; total time= 1.4s [CV] END ......max_features=1, n_estimators=200, subsample=1; total time= 0.4s [CV] END ......max_features=1, n_estimators=200, subsample=1; total time= 0.4s [CV] END ..max_features=0.9, n_estimators=150, subsample=0.8; total time= 1.8s [CV] END ..max_features=0.8, n_estimators=250, subsample=0.8; total time= 2.9s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.9; total time= 2.6s [CV] END ..max_features=0.8, n_estimators=100, subsample=0.9; total time= 1.3s [CV] END ..max_features=0.8, n_estimators=100, subsample=0.9; total time= 1.3s [CV] END ....max_features=0.8, n_estimators=150, subsample=1; total time= 2.0s [CV] END ....max_features=0.8, n_estimators=100, subsample=1; total time= 1.4s [CV] END ..max_features=0.7, n_estimators=250, subsample=0.9; total time= 2.8s [CV] END ..max_features=0.8, n_estimators=200, subsample=0.8; total time= 1.7s
## Gradient Boost gives over good performance with Recall.
gbc_tuned_test_df
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968904 | 0.873846 | 0.928105 | 0.900158 |
# Splitting the data into train and test sets
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
X_train_pipeline, X_test_pipeline, y_train_pipeline, y_test_pipeline = train_test_split(
X, y, test_size=0.30, random_state=1, stratify=y
)
print(X_train_pipeline.shape, X_test_pipeline.shape)
(7088, 19) (3039, 19)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
# ("pre", preprocessor),
(
"AdaBoost",
AdaBoostClassifier(
n_estimators=30,
learning_rate=1,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('AdaBoost',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=30,
random_state=1))])
# transforming and predicting on test data
model.predict(X_test)
array([0, 1, 0, ..., 0, 0, 0])
feature_names = X_train.columns
importances = adb_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
We have robust set of model options with more focus on recall and also precision balanced from all 18 models for future adjustment.
Most important metric is the recent usage on the account, which correlates heavily to churn.
Ratio of the total transaction amount also play an important role in churn prediction.
People with less than 18 age are dependent on family and might not have income source hence this category can be avoided. People olders than 66 can be given perks or avoided since this is churning age.